fix(query-language): preserve grouped filters in regex mode by brianphillips · Pull Request #1138 · sourcebot-dev/sourcebot

brianphillips · 2026-04-20T17:31:24Z

Summary

preserve query grouping in regex mode when parentheses wrap query syntax instead of raw regex atoms
keep bare regex groups like (foo|bar) parsed as single regex terms
add regression coverage for grouped prefix filters and filtered parenthesized regex or groups

Testing

../../node_modules/.bin/vitest --run (from packages/queryLanguage)

🤖 Agent Plan

Expand full agent plan

Trace how isRegexEnabled changes query parsing and identify where field-specific filters are converted into regex queries.
Reproduce the failing parse shape and confirm regex mode is treating grouped filter parentheses as part of a regex term instead of query grouping.
Update the query-language tokenizer so regex mode distinguishes raw regex parentheses from query-like parenthesized groups, including negated groups.
Add regression tests for grouped prefix filters, grouped regex OR expressions behind filters, and the nested filter plus (regex or regex) grouping discussed in this thread.
Run the focused packages/queryLanguage Vitest suite and verify the failing queries parse correctly.
Exclude unrelated repository state changes, including the vendor/zoekt submodule rewind, from the PR.

Summary by CodeRabbit

Release Notes

Bug Fixes
- Corrected regex mode search parsing to properly handle parenthesized filter groups. Query-style grouping syntax now works correctly when regex mode is enabled, allowing users to combine multiple filtered conditions with regex patterns for more precise and complex searches.

coderabbitai · 2026-04-20T17:31:40Z

Walkthrough

The PR fixes regex mode query parsing by introducing offset-aware helpers and isRegexQueryGroupingAt() to distinguish query-grouping parentheses from regex atom parentheses. The tokenizer now selectively emits parenthesis tokens and refines negation logic based on whether parenthesized content represents query grouping syntax.

Changes

Cohort / File(s)	Summary
Documentation & Comments `CHANGELOG.md`, `packages/web/src/features/search/parser.ts`	Added entry documenting the regex search parsing fix and updated comments to clarify distinction between bare regex parentheses and query-like parenthesized expressions.
Tokenizer Logic `packages/queryLanguage/src/tokens.ts`	Added offset-aware helpers (`matchesStringAt`, `startsWithPrefixAt`, `skipWhitespace`, `isEscapedAt`), introduced `findMatchingCloseParenOffset()`, `findMatchingOpenParenOffset()`, and `isRegexQueryGroupingAt()` for context-aware paren matching. Updated `parenToken`, `closeParenToken`, and `wordToken` behavior in `Dialect_regex` to emit parens only when they represent query grouping; simplified `negateToken` condition.
Test Coverage `packages/queryLanguage/test/regex.txt`	Added four new test cases covering parenthesized prefix/filter groups with `or` keyword, verifying correct AST generation for combinations like `ParenExpr(OrExpr(...))` and `NegateExpr(ParenExpr(OrExpr(...)))`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

fix(queryLanguage): allow parenthesized regex alternation in filter values #946: Modifies tokens.ts to distinguish query-grouping from regex/value parentheses in tokenization logic.
fix(web): treat parens and | as word chars when regex mode is enabled #1050: Modifies tokens.ts for regex mode parenthesis and pipe character tokenization behavior.
fix(parser): Allow parenthesis in query and filter terms #788: Updates tokenizer handling of parentheses and word tokens to differentiate grouping from term content.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main fix: preserving grouped filters in regex mode, which is the core purpose of the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (3)

packages/queryLanguage/src/tokens.ts (2)
198-298: Consider early-exiting once we know the inside isn't a single token.

isRegexQueryGroupingAt enumerates every top-level token into topLevelTokens, but the only facts that matter afterwards are (a) whether there is exactly one top-level token, and (b) the starting offset of that token (for firstCh / prefix / -/"/( decisions). For a long paren body that already contains two or more top-level tokens, you can return true as soon as the second one is seen instead of finishing the scan.
♻️ Suggested simplification
-    const topLevelTokens: Array<{ start: number; end: number }> = [];
-
-    while (offset < closeOffset) {
-        const tokenStart = offset;
-        ...
-        topLevelTokens.push({ start: tokenStart, end: offset });
-        offset = skipWhitespace(input, offset);
-    }
-
-    if (topLevelTokens.length !== 1) {
-        return true;
-    }
-
-    const [{ start, end }] = topLevelTokens;
+    // Scan the first top-level token and bail out to "grouping" as soon as
+    // we discover a second one — we don't need the token list.
+    let firstTokenStart = -1;
+    let firstTokenEnd = -1;
+    let seenTokens = 0;
+    while (offset < closeOffset) {
+        const tokenStart = offset;
+        // ...same inner scan as today...
+        if (seenTokens === 0) {
+            firstTokenStart = tokenStart;
+            firstTokenEnd = offset;
+        }
+        seenTokens++;
+        if (seenTokens > 1) {
+            return true;
+        }
+        offset = skipWhitespace(input, offset);
+    }
+    if (seenTokens === 0) {
+        return true; // whitespace-only body
+    }
+    const start = firstTokenStart;
+    const end = firstTokenEnd;
Not a correctness issue — purely a small readability / worst-case improvement for deeply parenthesized regex bodies.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/queryLanguage/src/tokens.ts` around lines 198 - 298, The function
isRegexQueryGroupingAt currently collects all top-level tokens into
topLevelTokens but only needs to know if there's exactly one and the start of
that one; change the loop so it tracks the first token's start (e.g.,
firstTokenStart) and a tokenCount instead of pushing into topLevelTokens, and as
soon as tokenCount becomes 2 return true immediately (early-exit) to avoid
scanning the rest of the paren body; keep the same logic afterward that uses the
single token's start (firstTokenStart) and end to check startsWithPrefixAt, '-'
handling, QUOTE, and nested OPEN_PAREN recursion.
300-335: Consider bounding the backward scan in findMatchingOpenParenOffset.

Unlike hasUnmatchedOpenParen (capped at 1000 chars backward), this helper walks the input backward without any upper bound. That wouldn't be a problem on its own, but in regex-mode wordToken it is invoked on every ) encountered while consuming a word (lines 536–541), and it is also the first thing closeParenToken does in regex mode. On pathological inputs (e.g., a long regex word containing many ) characters with no matching (), each ) triggers a full backward scan of everything consumed so far — an easy path to O(n²) tokenization.

Two options:

Match the 1000-char safety limit used in hasUnmatchedOpenParen so the scan is bounded. This is the smallest change and keeps behavior consistent within this file.

Better: memoize the result (or keep a running paren stack) in the regex-mode wordToken loop so the information is computed once up front instead of recomputed per ).
♻️ Minimal bounded-scan fix
 function findMatchingOpenParenOffset(input: InputStream): number | null {
     if (input.next !== CLOSE_PAREN) {
         return null;
     }

     let offset = -1;
     let depth = 1;

-    while (true) {
+    const MIN_OFFSET = -1000;
+    while (offset >= MIN_OFFSET) {
         const ch = input.peek(offset);
         ...
     }
+    return null;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/queryLanguage/src/tokens.ts` around lines 300 - 335, The backward
scan in findMatchingOpenParenOffset can run unbounded and cause O(n²) behavior;
bound the scan like hasUnmatchedOpenParen by stopping after a fixed limit (e.g.,
MAX_UNMATCHED_SCAN = 1000) and returning null if exceeded. Edit
findMatchingOpenParenOffset to track how many positions it has moved (or check
offset <= -MAX_UNMATCHED_SCAN) and break/return null when the limit is reached;
keep the rest of the escape and depth logic unchanged so callers like wordToken
and closeParenToken get a safe, bounded result.
packages/queryLanguage/test/regex.txt (1)
56-74: Good regression coverage; minor inconsistency worth noting.

The new fixtures line up with the grammar (ParenExpr { openParen query? closeParen }, NegateExpr { negate (PrefixExpr | ParenExpr) }) and exercise the interesting regex-mode grouping paths (top-level OR group, AND with prefix + group, multi-filter AND + group with internal spaces, negated group).

Only nit: the repo: regex on line 62 encodes - as \x2d while line 67 uses a plain -. Both are equivalent regex atoms, and both should travel through the tokenizer as part of the same Term/prefix word, so the divergence isn't strictly needed — unless the intent was explicitly to cover the hex-escape path. Consider either aligning on plain - or adding a one-line comment above line 62 stating that \x2d is there to exercise hex escapes inside a regex prefix value.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/queryLanguage/test/regex.txt` around lines 56 - 74, The fixture
using repo:^github\.com/sourcebot\x2ddev/sourcebot$ is inconsistent with the
later fixture that uses a plain hyphen; either change the hex-escaped \x2d to a
plain - so both regex-prefix fixtures use the same form, or add a one-line
comment above the repo:^github\.com/sourcebot\x2ddev/sourcebot$ fixture stating
that \x2d is intentionally used to exercise hex-escape handling in the tokenizer
(the surrounding grammar symbols to locate the case are the repo: prefix and the
ParenExpr/NegateExpr fixtures such as
Program(AndExpr(PrefixExpr(RepoExpr),ParenExpr(...)))).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/queryLanguage/src/tokens.ts`:
- Around line 198-298: The function isRegexQueryGroupingAt currently collects
all top-level tokens into topLevelTokens but only needs to know if there's
exactly one and the start of that one; change the loop so it tracks the first
token's start (e.g., firstTokenStart) and a tokenCount instead of pushing into
topLevelTokens, and as soon as tokenCount becomes 2 return true immediately
(early-exit) to avoid scanning the rest of the paren body; keep the same logic
afterward that uses the single token's start (firstTokenStart) and end to check
startsWithPrefixAt, '-' handling, QUOTE, and nested OPEN_PAREN recursion.
- Around line 300-335: The backward scan in findMatchingOpenParenOffset can run
unbounded and cause O(n²) behavior; bound the scan like hasUnmatchedOpenParen by
stopping after a fixed limit (e.g., MAX_UNMATCHED_SCAN = 1000) and returning
null if exceeded. Edit findMatchingOpenParenOffset to track how many positions
it has moved (or check offset <= -MAX_UNMATCHED_SCAN) and break/return null when
the limit is reached; keep the rest of the escape and depth logic unchanged so
callers like wordToken and closeParenToken get a safe, bounded result.

In `@packages/queryLanguage/test/regex.txt`:
- Around line 56-74: The fixture using
repo:^github\.com/sourcebot\x2ddev/sourcebot$ is inconsistent with the later
fixture that uses a plain hyphen; either change the hex-escaped \x2d to a plain
- so both regex-prefix fixtures use the same form, or add a one-line comment
above the repo:^github\.com/sourcebot\x2ddev/sourcebot$ fixture stating that
\x2d is intentionally used to exercise hex-escape handling in the tokenizer (the
surrounding grammar symbols to locate the case are the repo: prefix and the
ParenExpr/NegateExpr fixtures such as
Program(AndExpr(PrefixExpr(RepoExpr),ParenExpr(...)))).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 10c94d85-06a4-4244-80bb-961710e07771

📥 Commits

Reviewing files that changed from the base of the PR and between b2941c4 and 5dbf6b2.

📒 Files selected for processing (4)

CHANGELOG.md
packages/queryLanguage/src/tokens.ts
packages/queryLanguage/test/regex.txt
packages/web/src/features/search/parser.ts

brendan-kellam

LGTM thx!

fix(query-language): preserve grouped filters in regex mode

f360fe1

chore(changelog): add entry for pr 1138

5dbf6b2

coderabbitai Bot reviewed Apr 20, 2026

View reviewed changes

brendan-kellam approved these changes Apr 20, 2026

View reviewed changes

brendan-kellam merged commit 8b56ea8 into sourcebot-dev:main Apr 20, 2026
6 checks passed

github-actions Bot mentioned this pull request Apr 20, 2026

Sourcebot Roadmap 🚀 #459

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(query-language): preserve grouped filters in regex mode#1138

fix(query-language): preserve grouped filters in regex mode#1138
brendan-kellam merged 2 commits intosourcebot-dev:mainfrom
brianphillips:paren-grouping-parsing

brianphillips commented Apr 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

brendan-kellam left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brianphillips commented Apr 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

🤖 Agent Plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

brendan-kellam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brianphillips commented Apr 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading